Liting Cui

[1] "/Users/wenhaochen/Desktop/statistics/EDA_Course_Materials/Final_project"

Univariate Plots Section

'data.frame':   4898 obs. of  13 variables:
 $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
 $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
 $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
 $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
 $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
 $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
 $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
 $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
 $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
 $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
 $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
 $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
       X        fixed.acidity    volatile.acidity  citric.acid    
 Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
 1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
 Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
 Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
 3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
 Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
 residual.sugar     chlorides       free.sulfur.dioxide
 Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
 1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
 Median : 5.200   Median :0.04300   Median : 34.00     
 Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
 3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
 Max.   :65.800   Max.   :0.34600   Max.   :289.00     
 total.sulfur.dioxide    density             pH          sulphates     
 Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
 1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
 Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
 Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
 3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
 Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
    alcohol         quality     
 Min.   : 8.00   Min.   :3.000  
 1st Qu.: 9.50   1st Qu.:5.000  
 Median :10.40   Median :6.000  
 Mean   :10.51   Mean   :5.878  
 3rd Qu.:11.40   3rd Qu.:6.000  
 Max.   :14.20   Max.   :9.000  

This dataset contains 12 variables, with 4898 observations.


   3    4    5    6    7    8    9 
  20  163 1457 2198  880  175    5 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   5.000   6.000   5.878   6.000   9.000 

In this dataset, most of samples are rated as 5 or 6, which are nomral wines. The mean of wine ratings are 5.878. The excellent or poor wines are much less, with only 5 samples rated as 9 and 20 as 3.

I selected four varialbes which I am interested in to plot: volatile acidity, alcohol, chlorides and density.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9871  0.9917  0.9937  0.9940  0.9961  1.0390 

1)The density value are pretty much concentrating between 0.9871 to 1, the value distribution appears normal.
2)The alcohol values spread between 8% to 14%, but most wine samples have alcohol content between 9% to 13%.
3)The chlorides values are concentrating between 0.01 to 0.1, but this variable seems have a lot of outliers on the right side. The boxplot proves what I thought. To make the data distribution appear more normal, log10() transform was applied. 4)For volatile adicity attribute, most of values are between 0.2 and 0.3. However this attribute’s distribution has relatively longer tail on the right side. There are 170 samples containing more than 0.5 g / dm^3 volatile adicity. I wonder if the wine containing high volatile adicity tend to be rated lower, since the too high levels of sulfur dioxide will lead to an unpleasant, vinegar taste. 5)Just like volotile adicity and chlorides, the distribution of sugar value is pretty right-skewed. Log-transformation makes the distribution appears bimoda.

I also created a new variale ‘perfree’ to represent the percent of free SO2 in total SO2. I am curious if this new variable is correlated with wine quality.

Univariate Analysis

What is the structure of your dataset?

There are 4898 white wine test samples in the dataset with 12 features, which are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality.

What is/are the main feature(s) of interest in your dataset?

The main features of interest in this dataset are wine quality and acohol content. I want to investigate which features can be used to best predict wine quality. According to my online research, the wine alcohol content is talked about a lot by wine expert as well as the normal consumers, so I am interested to know how the alcohol level will impact the wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

In addition to the alcohol, I think other features such as wine density, volatile acidity and chlorides can also impact the quality of wine. For example, as the data documentation mentions, the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.

Did you create any new variables from existing variables in the dataset?

A new varialbe was created to represent the percent of free sulfur dioxide in total sulfur dioxide.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The variable chlorides distribution appears right-skewed, so I log-transformed this variable. After the transformation, the chlorides content distribution looks more normal.

Bivariate Plots Section

I want to see how correclated the different values are.

'data.frame':   4898 obs. of  13 variables:
 $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
 $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
 $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
 $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
 $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
 $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
 $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
 $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
 $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
 $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
 $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
 $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
 $ perfree             : num  0.265 0.106 0.309 0.253 0.253 ...

According to the plot matrix, attribute fixed acididty, critic acid, residual sugar, free sulfur dioxide, pH, sulphates do not seem have strong correlations with wine quality. But alcohol, density, chlorides and percent of SO2 in total SO2 are moderately correlated with wine quality.

Our goal is to investigate which attributes has biggest impact on wine quality and how they impact, but before I conduct further analysis between attributes and quality, I wanted to look at how the feature attributes are correlated. I selected two paris of variables which are highly correlated with each other: density and alcohol, density and sugar.

This plot shows a clear relation between density and alcohol. With density of wine increasing, the alchohol content decrease.

Density and residual sugar are strongly and positively related, which makes sense as wine density is mainly depending on sugar and alcohol content.

Next I am going to plot the wine quality against alcholol content and density, since these two attributes have highest correlation coefficient with wine quality.

winedf$quality: 3
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.00    9.55   10.45   10.34   11.00   12.60 
-------------------------------------------------------- 
winedf$quality: 4
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.40    9.40   10.10   10.15   10.75   13.50 
-------------------------------------------------------- 
winedf$quality: 5
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  8.000   9.200   9.500   9.809  10.300  13.600 
-------------------------------------------------------- 
winedf$quality: 6
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.50    9.60   10.50   10.58   11.40   14.00 
-------------------------------------------------------- 
winedf$quality: 7
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.60   10.60   11.40   11.37   12.30   14.20 
-------------------------------------------------------- 
winedf$quality: 8
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.50   11.00   12.00   11.64   12.60   14.00 
-------------------------------------------------------- 
winedf$quality: 9
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   12.40   12.50   12.18   12.70   12.90 

The alcohol content and quality are highly correlated according to the correlation matrix. As the point plot and boxplot shows, the wine which are rated as 5 has lowest median alcohol content. But for wine which are rated above mean value, the wine quality tend to improve as the alcohol increases.

winedf$quality: 3
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
-------------------------------------------------------- 
winedf$quality: 4
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
-------------------------------------------------------- 
winedf$quality: 5
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
-------------------------------------------------------- 
winedf$quality: 6
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
-------------------------------------------------------- 
winedf$quality: 7
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
-------------------------------------------------------- 
winedf$quality: 8
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
-------------------------------------------------------- 
winedf$quality: 9
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9896  0.9898  0.9903  0.9915  0.9906  0.9970 

The density seems has negative impact on wine quality.As the plot shows,the better wine usually has lower density.

There is a tendency of percent of free SO2 in total SO2 among different wine qualities. The good quality wine tends to have higher percent of free SO2. As the plot shows, the wine scores that are beyond the median quality value of 6, tend to have free SO2 percent value beyond the median value of free SO2 percent. However, compared with alcohol and density, the difference of free SO2 percent value among different wine quality group is not that evident.

[1] -0.2728567
[1] -0.2099344

In general, the wine quality increase as the chlorides decreases. However, as the boxplot shows, there are many Chlorides outlier values in quality 5 and 6. In addition, after the chlorides variable is log-transformed, the absolute correlation coefficient between chlorides and wine quality increases from 0.2 to 0.27.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Among all the investigated variables, the wine quality is most related to alcohol content, with correlation coefficient of 0.44. With wine quality above 5, the quality tends to improve as the alcohol content increases.

In addition to alcohol content, the wine quality is also highly related to wine density. The better wine usually has lower density.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is a strong relation between wine density and residual sugar, which is expected since the data documentation has mentioned that density of water is depending on the percent of alcohol and sugar content. The wine density increases as the residual sugar content. Besides, the wine density and alcohol content are also strongly and negatively correlated. There is an obvious tendency that the wine density decreases as the alcohol content increase. This strong relation concerns me since I am planning to incoporate both density and alcohol content into predictive model. It could introduce multicollinearity issue.

What was the strongest relationship you found?

The alcohol content is moderately and positively correlated with wine quality. The density of wine also correlates with wine quality, but less than alcohol content.

Alcohol content and wine density are highly correlated with each other, which may cause Multicollinearity issue when building predictive model with those two attributes.

Multivariate Plots Section

There are more higher quality wine samples (rate > 6) located on the upper left section than the other three sections, which is corresponding to my previously analysis that wine quality is positively related to alcohol, but negatively related to density.

Most of higher quality sample points are clustering on low-chlorides low-density section of this point plot.

Among all those variables, alcohol content is most strongly related to wine quality, but the chlorides can also explain some variations. As above plot show, holding alcohol constant, most of high quality wines samples are below the mean line of chlorides values.

It’s hard to see there is a pattern of wine quality distriubtion along the perfree variable (percent of free SO2 in total SO2). With alcohol constant, the high quality wine seems evenly distributes along the Y exis.

winedf$quality: 3
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1700  0.2375  0.2600  0.3332  0.4125  0.6400 
-------------------------------------------------------- 
winedf$quality: 4
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1100  0.2700  0.3200  0.3812  0.4600  1.1000 
-------------------------------------------------------- 
winedf$quality: 5
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.100   0.240   0.280   0.302   0.340   0.905 
-------------------------------------------------------- 
winedf$quality: 6
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0800  0.2000  0.2500  0.2606  0.3000  0.9650 
-------------------------------------------------------- 
winedf$quality: 7
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0800  0.1900  0.2500  0.2628  0.3200  0.7600 
-------------------------------------------------------- 
winedf$quality: 8
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1200  0.2000  0.2600  0.2774  0.3300  0.6600 
-------------------------------------------------------- 
winedf$quality: 9
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.240   0.260   0.270   0.298   0.360   0.360 

No strong correlation was observed between wine quality and volatile acidity. The high quality wine seems distributes evenly along the volatile acidity, which surprises me since it’s said that high level of acetic acid would lead to an unpleasant taste.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Alcohol content is most strongly correlated with wine quality, but other variables can also contribute to the quality variation. For instance, holding alcohol variable constant, most of high quality wines samples have chlorides below the mean.

Were there any interesting or surprising interactions between features?

I was expecting that volotile acidity content would have a impact on wine quality since the data documentation mentions the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. However, there is no obvious pattern of wine quality distribution along volitile acidity.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One


   3    4    5    6    7    8    9 
  20  163 1457 2198  880  175    5 

Description One

Most of wine samples are rated as 5 or 6, which are nomral wines. The mean of wine ratings is 5.878. The excellent or poor wines are much less, with only 5 samples rated as 9 and 20 as 3.

Plot Two

Description Two

The alcohol content and wine quality are highly correlated according to the correlation matrix. As the point plot and boxplot shows, the wine which are rated as 5 has lowest median alcohol content. But for wine which are rated above mean value, the wine quality tend to improve as the alcohol increases.

Plot Three

Description Three

There are more higher quality wine samples (rate > 6) located on the lower left section than other sections, which means the quality is negatively related to wine density and chlorides content.


Reflection

This white wine dataset contains 4898 samples, with 12 attribute meansured. In this dataset, most of samples are rated as 5 or 6, which are nomral wines. The excellent or poor wines are much less, with only 5 samples rated as 9 and 20 as 3.

The alcohol contecnt is discussed about a lot by wine expert as well as normal consumers, so my initial interest is to explore the relation between wine quality and alcohol content. By calculating the correlation coefficient and plotting quality and alcohol variable, I found wine quality is moderately correlated with alcohol content and better wines usually have higher alcohol content.

Obviously, the alcohol content is not the only factor determining the wine quality. So I explored other variables such as density, chlorides and volatile acidity, percent of free SO2 in total SO2, which could possibly impact the wine quality. I could see there is a trend between density, chlorides and quality but I was suprised that there is no abvious pattern of quality distribution against volatile acidity. What I expected was that wine with high level of volatile acidity would be rated lower since too high level of volatile acidity could lead to an unpleasant taste.

According to the correlation matrix, both alcohol and density are correlated with wine quality, but in the meanwhile these two variables are also highly correlated with each other. In the future if a preditive model will be built, incoporating these two variables will introduce multicollinearity issue. So further analysis needs to be conducted before using these two variables in model.